Cache-based communication between execution threads of a data processing system

ABSTRACT

A virtual link buffer provides communication between processing threads or cores. A first cache is accessible by a first processing device and a second cache accessible by a second processing device. An interconnect structure couples between the first and second caches and includes a link controller. A producer cache line in the first cache stores data produced by the first processing device and the link controller transfers data in the producer cache line to a consumer cache line in the second cache. Each new data element is stored at a location in the producer cache line indicated by a store position or tail indicator that is stored at a predetermined location in the same cache line. Transferred data are loaded from a location in the consumer cache line indicated by a load position or head indicator that is stored at a predetermined location in the same consumer cache line.

GOVERNMENT LICENSE RIGHTS

This invention was made with Government support under the Fast Forward 2contract awarded by DOE. The Government has certain rights in thisinvention.

TECHNICAL FIELD

The present disclosure relates to a hardware-accelerated directedcommunication channel implemented using caches in a data processingsystem. The communication channel has application for data transferbetween execution threads in a data processing system.

BACKGROUND

Data processing systems commonly execute a number of threads. Theexecution threads may be performed serially on single serial processorusing time-slicing, in parallel on a number of linked processing cores,or a combination thereof. In many applications, there is a desire topass data from one execution thread to another via a data channel.Moreover, the data may be passed in a specified pattern. For example, afirst-in, first-out (FIFO) communication pattern is inherent in manyapplications, where data is entered sequentially into a storage mediumand is removed from the storage medium in the same sequential order.Thus, the first data stored in the medium will be the first data takenout. A FIFO may be implemented explicitly as a buffer in hardware or itmay implement in software. In other applications, the order of the datais not important, but the data is still generated by a producer anddirected towards a consumer.

It is well known that processes and threads executing in a dataprocessing system may share information through use of a common storage,either a physical storage medium or a virtual address space. However, inthis kind of communication, information is not directed from one processor thread to another. Directed communication may be achieved usingsoftware in conjunction with a shared memory, but transmission of datafrom one thread to another consumes valuable processor resources (e.g.,through locks, false sharing, etc.). These events conspire to increaselatency, increase energy usage, and decrease overall performance.Similarly, transmission of data from one processing core to anotherrequires communication through multiple layers of cache hierarchy.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagrammatic representation of a first-in, first-out (FIFO)buffer.

FIG. 2 is a simplified diagram of a data processing system forimplementing a virtual link buffer, in accordance with embodiments ofthe disclosure.

FIG. 3 is a diagrammatic representation of an example of a cache memory.

FIG. 4 is a diagrammatic representation of an address in a backingstorage device, such as a main memory.

FIG. 5 is a diagram of a data block of a single cache line of a producerdevice, in accordance with embodiments of the disclosure.

FIG. 6 is a diagram of a data block of a single cache line of a consumerdevice, in accordance with embodiments of the disclosure.

FIG. 7 is a diagram of a data block of a single cache line, inaccordance with further embodiments of the disclosure.

FIG. 8 is a flow chart showing operation of a ‘make_fifo’ instruction,in accordance with embodiments of the disclosure.

FIG. 9 is a flow chart showing operation of an ‘open_fifo_producer’instruction, in accordance with embodiments of the disclosure.

FIG. 10 is a flow chart showing operation of a store instruction, inaccordance with embodiments of the disclosure.

FIG. 11 is a flow chart showing operation of a load instruction, inaccordance with embodiments of the disclosure.

FIG. 12 shows a link controller table, in accordance with certainembodiments.

FIG. 13 shows a virtual memory FIFO table, in accordance with certainembodiments.

FIG. 14 is a flow chart of a method of operation of a producer devicefor link buffer communication, in accordance with certain embodiments.

FIG. 15 is a flow chart of a method of operation of a link controllerfor link buffer communication, in accordance with certain embodiments.

FIG. 16 is a flow chart of a method of operation of a link controllerfor link buffer communication, in accordance with certain embodiments.

FIG. 17 is a diagrammatic representation of data processing system thatimplements a virtual FIFO buffer, in accordance with certainembodiments.

DETAILED DESCRIPTION OF THE EMBODIMENTS

While this invention is susceptible of embodiment in many differentforms, there is shown in the drawings and will herein be described indetail specific embodiments, with the understanding that the presentdisclosure is to be considered as an example of the principles of theinvention and not intended to limit the invention to the specificembodiments shown and described. In the description below, likereference numerals may be used to describe the same, similar orcorresponding parts in the several views of the drawings.

In this document, relational terms such as first and second, top andbottom, and the like may be used solely to distinguish one entity oraction from another entity or action without necessarily requiring orimplying any actual such relationship or order between such entities oractions. The terms “comprises,” “comprising,” “includes,” “including,”“has,” “having,” or any other variations thereof, are intended to covera non-exclusive inclusion, such that a process, method, article, orapparatus that comprises a list of elements does not include only thoseelements but may include other elements not expressly listed or inherentto such process, method, article, or apparatus. An element preceded by“comprises . . . a” does not, without more constraints, preclude theexistence of additional identical elements in the process, method,article, or apparatus that comprises the element.

Reference throughout this document to “one embodiment,” “certainembodiments,” “an embodiment,” “implementation(s),” “aspect(s),” orsimilar terms means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the present disclosure. Thus, theappearances of such phrases or in various places throughout thisspecification are not necessarily all referring to the same embodiment.Furthermore, the particular features, structures, or characteristics maybe combined in any suitable manner in one or more embodiments withoutlimitation.

The term “or” as used herein is to be interpreted as an inclusive ormeaning any one or any combination. Therefore, “A, B or C” means “any ofthe following: A; B; C; A and B; A and C; B and C; A, B and C.” Anexception to this definition will occur only when a combination ofelements, functions, steps or acts are in some way inherently mutuallyexclusive. Also, grammatical conjunctions are intended to express anyand all disjunctive and conjunctive combinations of conjoined clauses,sentences, words, and the like, unless otherwise stated or clear fromthe context. Thus, the term “or” should generally be understood to mean“and/or” and so forth.

All documents mentioned herein are hereby incorporated by reference intheir entirety. References to items in the singular should be understoodto include items in the plural, and vice versa, unless explicitly statedotherwise or clear from the text.

Recitation of ranges of values herein are not intended to be limiting,referring instead individually to any and all values falling within therange, unless otherwise indicated, and each separate value within such arange is incorporated into the specification as if it were individuallyrecited herein. The words “about,” “approximately,” “substantially,” orthe like, when accompanying a numerical value, are to be construed asindicating a deviation as would be appreciated by one of ordinary skillin the art to operate satisfactorily for an intended purpose. Ranges ofvalues and/or numeric values are provided herein as examples only, anddo not constitute a limitation on the scope of the describedembodiments. The use of any and all examples, or exemplary language(“e.g.,” “such as,” or the like) provided herein, is intended merely tobetter illuminate the embodiments and does not pose a limitation on thescope of the embodiments. No language in the specification should beconstrued as indicating any unclaimed element as essential to thepractice of the embodiments.

For simplicity and clarity of illustration, reference numerals may berepeated among the figures to indicate corresponding or analogouselements. Numerous details are set forth to provide an understanding ofthe embodiments described herein. The embodiments may be practicedwithout these details. In other instances, well-known methods,procedures, and components have not been described in detail to avoidobscuring the embodiments described. The description is not to beconsidered as limited to the scope of the embodiments described herein.

In the following description, it is understood that terms such as“first,” “second,” “top,” “bottom,” “up,” “down,” “above,” “below,” andthe like, are words of convenience and are not to be construed aslimiting terms. Also, the terms apparatus and device may be usedinterchangeably in this text.

The various embodiments and examples of the present disclosure aspresented herein are understood to be illustrative of the presentdisclosure and not restrictive thereof and are non-limiting with respectto the scope of the present disclosure.

Further particular and preferred aspects of the present disclosure areset out in the accompanying independent and dependent claims. Featuresof the dependent claims may be combined with features of the independentclaims as appropriate, and in combinations other than those explicitlyset out in the claims.

The present disclosure relates to a hardware accelerated, directedcommunication channel, implemented using caches and a link controller,for providing a data link between execution threads in a data processingsystem. The communication channel provides a virtual link buffer. Forexample, the caches may be used to implement ordered communication ofdata (such a first-in, first-out (FIFO) pattern, or a last-in, last-out(LIFO) pattern), or unordered communication.

FIG. 1 is a diagrammatic representation of a system 100 in which data istransferred between first processing core 102 and second processing core104 using a first-in, first-out (FIFO) buffer 106. Data 108 produced byfirst processing core 102 is added to the tail 110 of buffer 106 anddata 112 is consumed by the second processing core 104 from the head 114of the buffer 106. Buffer 106 may be implemented in dedicated hardware.

FIG. 2 is a simplified diagram of a data processing system 200. By wayof example, system 200 comprises first processing device 202 and secondprocessing device 204 that are coupled by an interconnect structure 206.Devices 202 and 204 and interconnect structure 206 may be implemented onthe same chip (as shown) or on separate chips. In general, system 200may have any number of devices. First processing device 202 includes atleast one cache 208 together with a cache controller 210. Similarly,second processing device 204 includes at least one cache 212 togetherwith a cache controller 214. The at least one cache may be a hierarchyof caches, such as level one (L1) cache and a level two (L2) cache.Interconnect structure 206 includes a coherence controller 216 thatcontrols data flow between the caches 208 and 212 and a main memory 218.In this example, the main memory 218 is accessed via interconnectstructure 206 and memory controller 220. In operation, the first andsecond processing devices both access main memory 218. To speedoperation, copies of data may be held in caches 208 and 212. Coherencecontroller 216 ensures that data accessed by the first and secondprocessing devices is up to date and has not been modified by anotherdevice.

A FIFO, LIFO or other communication pattern implemented in softwarealone would have to operate via multiple layers of cache hierarchies 208and 212. The communication channel would have a large latency as aresult of multiple cache misses and snoop operations. A softwareimplemented FIFO, for example, might require more than 70 instructionsor micro-operations to perform a single push or pop operation.

In accordance with an embodiment of the disclosure, a virtual linkbuffer between execution threads is implemented using one or more cachelines in caches 208 and 212. Control of the cache lines, used toimplement a virtual link buffer, is provided by link controller 222.Link controller 222 may be implemented in interconnect structure 206.Link controller 222 is implemented in hardware and provides hardwareacceleration for direct communication between producer and consumerdevice.

Link controller 222 may maintain a table 224 to track cache lines beingused as link buffers. This approach provides hardware support for linkedexecution threads using existing cache systems. Communication isachieved that is analogous to an explicit hardware implementation. Whilethe link controller is implemented in hardware, use is made of theexisting cache and interconnect structures. Thus, communication isachieved without the cost (in terms of hardware area, static energy,etc.) of a full hardware solution.

The existing cache hierarchy provides a means to signal directly fromone thread to another or from one processing core to another.Communication may take place within a single core (L1), across sharedcache (L2 or L3), or even across multiple cores when backed by a virtualmemory system or across multiple nodes when using a globally accessibleaddressing scheme.

In accordance with some embodiments, a data processing system isprovided for implementing a virtual link buffer. The data processingsystem includes a first cache accessible by a first processing device, asecond cache accessible by a second processing device and aninterconnect structure that couples the first cache and the secondcache, the interconnect structure comprising a link controller. Aproducer cache line in the first cache is configured to store aplurality of data elements produced by the first processing device. Thelink controller is configured to transfer data elements in the producercache line to a consumer cache line in the second cache. The consumercache line is configured to provide the plurality of data elements,produced by the first processing device, to the second processingdevice.

The data elements may be produced in sequence, where each data elementis stored at a location in the producer cache line indicated by a storeposition indicator, where the store position indicator is stored at apredetermined location in the producer cache line and where the firstcache controller is configured to access the tail indicator. The storeposition indicator may be referred to herein as a tail indicator for aqueue-like data buffer or as a top indicator for a stack-like databuffer.

A second cache controller may be provided, where a data element isloaded from a location in the consumer cache line indicated by a loadposition or head indicator, where the load indicator is stored at apredetermined location in the consumer cache line and where the secondcache controller is configured to access the load position indicator.The load position indicator may be referred to herein as a headindicator for a queue-like data buffer of a top indicator for astack-like data buffer.

The producer cache line may be associated with a producer handle and theconsumer cache line associated with a consumer handle. The producerhandle and the consumer handle are stored in a table in a memory and areaccessible by the link controller.

The link controller may be configured to buffer data elementstransferred from the producer cache line to the consumer cache line inthe memory and to maintain an order of buffered data elements.

The first processing device and the second processing device may beintegrated with the data processing system.

A first virtual address for identifying a cache line in a cache of theproducer processing device is associated with a second virtual addressfor identifying a cache line in a cache of the consumer processingdevice. In accordance with some embodiments, a virtual link buffer isprovided between the producer processing device and the consumerprocessing device in a data processing system by storing, by theproducer processing device, one or more data elements in a first cacheline of the producer processing device, the first cache line identifiedby the first virtual address, transferring, by a link controller in aninterconnect structure that couples the producer and consumer processingdevices, the one or more data elements in the first cache line to asecond cache line in the cache of the consumer processing device, thesecond cache line identified by the second virtual address. The consumerprocessing device may then load the one or more data elements from thesecond cache line.

The data elements may be produced and consumed in sequence. In someembodiments, the producer device reads a store position indicator from adesignated location in the first cache line, stores the data element ata location in the first cache line indicated by the store positionindicator and updates the store position indicator.

The consumer device reads load position indicator from a designatedlocation in the second cache line, loads the data element at a locationin the second cache line indicated by the load position indicator andupdates the load position indicator.

The first virtual address may be translated to a first physical addressin a storage device of the data processing system and the first cacheline identified from the first physical address.

The link controller may allocate a producer handle comprising apseudo-address for enabling the producer processing device to referencethe virtual link buffer and a consumer handle comprising apseudo-address for enabling the consumer processing device to referencethe virtual link buffer. These handles may be associated with oneanother in a table, for example. In some embodiments, the linkcontroller may only provide a single pseudo-address serving as a commonhandle to both producer and consumer.

In some embodiments, the first cache line is transferred to the linkcontroller, stored in a line buffer in a memory of the data processingsystem and, at a later time, transferred the first cache line from theline buffer to the second cache line of the consumer processing device.The line buffer may be a first-in, first-out line (FIFO) buffer or afirst-in, last-out (FILO) line buffer, or in a relaxed ordering betweenproducer and consumer. In other embodiments, the link controller may beconfigured via signal to perform one of the aforementioned orderings.

Order of lines stored in the line buffer may be maintained by the linkcontroller by accessing a memory, where the memory contains one or moreof a table, a head pointer, a tail pointers, or a linked list. Acoherence state of the cache lines stored in the line buffer may bemaintained to enable consumption of data by more than one consumerprocessing device (as specified by the producer-consumer routing table).

Data elements in the first cache line may be transferred to the secondcache line by the link controller receiving a request from the consumerprocessing device for data associated with the second virtual addressallocated to the virtual link buffer and determining if one or morecache lines associated with the virtual link buffer are stored the linebuffer. When one or more cache lines associated with the virtual linkbuffer are stored the line buffer, a cache line of one or more storedcaches lines is selected and the contents transferred to the consumerprocessing device.

In further embodiments, transferring the one or more data elements inthe first cache line to the second cache line comprises the linkcontroller receiving a request from the consumer processing device fordata associated with the second virtual address allocated to the virtuallink buffer, identifying the first virtual address allocated to thevirtual link buffer, requesting a cache line associated with theidentified first virtual address from the producer processing device andtransferring a cache line received from the producer processing deviceto the consumer processing device.

Requesting the cache line associated with the identified first virtualaddress from the producer processing device may comprise requesting acache line associated with a physical address that maps to theidentified first virtual address.

After reading a store position indicator from a designated location inthe first cache line, it may be determined from the store positionindicator if the first cache line is full and the first cache line maybe transferred to the link controller when the first cache line is full.

Alternatively, the first cache line may be transferred from the bufferin memory to the cache line of the second cache line in response to asignal from the consumer processing device.

Alternatively, a non-full first cache line may be transferred from theproducer to the consumer in response to a signal from the linkcontroller, which is generated by a consumer signaling demand for dataacross the link.

The one or more data elements in the first cache line may be rearrangedbefore transferring to the second cache line.

FIG. 3 is a diagrammatic representation of a cache memory 300. Data isstored in blocks 302 that are, at least conceptually, arranged as anarray having a number W of columns and a number of lines. The lines areconceptually grouped as S sets of M lines. For ease of access, W=2^(w),S=2^(s) and M=2^(m) are often selected to be powers of 2. In oneexample, each block 302 is a byte of data, w=6, s=12 and m=0, so thatW=64, S=4096, and M=1. The location of the original data, of whichblocks 302 are copies, is identified by tags 304 and by the location ofthe data within the array. In addition, each cache line includes one ormore status bits 306. The status bits may indicate, for example, if thedata in line is valid or invalid, and permissions associated with thedata. For example, status bits 306 might indicate the MESI state of thedata (i.e. whether the data is Modified, Exclusive, Shared or Invalid).The tags 304 and status bits 306 are herein termed ‘metadata’.

In one embodiment, the status bits 306 include a bit that indicates thecache line is to be accessed as a virtual link buffer. In a furtherembodiment, the status bits 306 include a first bit that indicates ifthe cache line is to be accessed as a producer link buffer and a secondbit that indicates if the cache line is to be accessed as a consumerlink buffer. This enables the cache controller to determine how thecache line should be accessed.

The tag and data structures may be separated into two, with conceptuallythe same numbers of sets/ways, so a match found in a region of the tagarray has a corresponding region in the data array. The data RAM maycomprise multiple RAMs that can be individually accessed, so that, whena match is found in the tag array, the correct data element can beaccessed.

FIG. 4 is a diagrammatic representation of an address 400 in a backingstorage device, such as a main memory. The address 400 has n bits. Forexample, in some data processing systems, n=64. The lowest w bits 402 ofthe address 400 may be used as a column offset that indicates which datacolumn of cache 300 could contain a copy of the data stored at thataddress in the backing memory. The next s bits 404 of the address 400comprise a set index that indicates which set of cache lines couldcontain the copy of the data. The upper t bits of the address are usedas tag for the address, where t=n−s−w. When M=1, the cache is directlymapped and a copy of the data is stored in the cache if the tag matchesthe tag stored at the cache line indicated by the set index. When M>1, acopy of the data is stored in the cache if the tag matches the tagstored at any cache line in the set of lines indicated by the set index.If the tag does not match any of the M tags in the set, the data isknown to be not in the cache.

A virtual link buffer uses cache lines as the means for transporting thedata that resides within the link buffer, and simultaneously forpackaging elements of the link state within each cache line. As anexample, a cache line used as a virtual LIFO or stack buffer may havethe following structure:

typedef struct fifo_line /** 64-byte cache line example **/ {   uint16_t/** 2-bytes, 8-bits per byte  **/     top_index /** index to read/write**/: 6,     ele_size /** number of bytes  **/: 3,     reserved /** extrabits    **/: 7;   uint8_t data[ 62 ] /** 62-bytes, 8-bits per byte **/;} /** 64-bytes total or 1-cache line **/;

The seven reserved bits could be used to address larger sizes atomically(i.e., a vector) or they could be used for protections or othermetadata.

As a further example, a cache line used as a virtual FIFO buffer mayhave the following structure:

typedef struct fifo_line /** 64-byte cache line example **/ {   uint16_t/** 2-bytes, 8-bits per byte  **/     head_index /** index to read **/:6,     tail_index /** index to write **/: 6,     ele_size /** number ofbytes **/: 3,     reserved /** extra bits **/: 1;   uint8_t data[ 62 ]/** 62-bytes, 8-bits per byte **/; } /** 64-bytes total or 1-cache line**/;where the range can address the entire 62 bytes stored in the line, theelement size can adjust the size of data being addressed from a singlebyte through to 8 bytes.

FIG. 5 is a diagram of a data block 500 of a single cache line used as avirtual LIFO buffer or stack. In this example, the data block contains64 bytes, but data blocks of other sizes may be used. FIG. 5 shows anexample of how data in a cache line may be arranged for use as a virtuallink buffer in a cache of a processing device that is a producer of datato be transferred via the virtual link buffer or a consumer device thatreceives the data. In this example, the right-most 62 bytes 508 are usedto store buffer data, while the left-most two bytes are used to storemetadata relating to the data in the virtual link buffer. The metadatafields in this example include 6-bits field 502 for storing an indicatoror index of the last-in, first-out element (i.e. the top of the stack)and 3 bit field 504 indicating the length of each data element stored inthe buffer. The remaining 7 bits are unused. In one embodiment, data iswritten to the cache in sequential order starting with the right-mostentry, which is the head (H) of the buffer. The top index, stored infield 502, indicates the number of bytes stored in the virtual linkbuffer and also indicates the position 510 where the next data value isto be written (or the previous data value was read). The most recentlywritten byte is the top (T) of the buffer. In FIG. 5, data has beenwritten to the ‘gray’ byte locations.

As described above, each cache line includes a tag which normallyidentifies a region of memory. When a cache line is used as a virtuallink buffer, it is allocated a specific address. In one embodiment, aset of addresses in virtual memory may be predefined and reserved foruse with virtual link buffers.

The disclosure is described in more detail below with reference toembodiments of a FIFO communication channel (sometimes referred to as a‘queue’). However, this is but one embodiment of the disclosure. It willbe apparent to those of ordinary skill that other communicationpatterns, such as a LIFO pattern (sometimes referred to as a ‘stack’),or an unordered link may be implemented.

In one embodiment, FIFO handles are assigned by a link controller toenable reference to a particular virtual link buffer. The handles may beassigned in response to a specific instruction from a producer orconsumer device store to a reserved address. The instruction is trappedto the link controller and causes allocation of the FIFO handles. TheFIFO handles are referred to herein as ‘pseudo-addresses’, since they donot correspond to an physical memory or storage address.

In a further embodiment, the FIFO handles may be allocated in software.

The producer device may write data to the virtual link buffer using aspecific instruction or by writing to the buffer address. The cachecontroller of the producer device recognizes from the instruction or theaddress that the cache line is used as a virtual link buffer, writes thedata to the position in the cache line indicated by the tail indicator,and then updates the tail indicator (for example, the tail index may bemodified by the element size).

For each pseudo-address or handle associated with virtual link buffer inthe cache of a producer device, there is a corresponding pseudo-addressor handle of a virtual link buffer that may be stored in a cache of aconsumer device. This pseudo address may be predefined or assigned bythe link controller, for example.

A consumer device may read data from a virtual link buffer by issuing acustom instruction (such as a ‘pop’ instruction) or by issuing a requestto load data from the buffer address. If the corresponding cache linedoes not exist in the cache of the consumer device, the address ispassed to the interconnect structure. The link controller determines thecorresponding producer buffer address and requests the associated cachefrom the producer device. Data received from the producer device isforwarded to the consumer device. In this way, the consumer deviceobtains a copy of the virtual link buffer without a need for additionaldata paths in the interconnect structure.

FIG. 6 is a diagram of a data block 600 of a single cache line, andshows an example of how data in a cache line may be arranged for use asa virtual FIFO buffer in a cache of a consumer device. The tag of thecache line indicates the address of the consumer buffer. The tailindicator field 602, element size field 604 and data field 606 arecopied from the cache line of the producer device. The head indicatorfield 608 is initially set to zero. The head indicator field indicatesthe head of the buffer, that is, the position 610 from which data is tobe next read from the buffer, and is incremented each time data is read.When the head indicator is equal to the tail indicator, the buffer isempty and a new request must be made to the interconnect if more data isrequired.

Thus, the head location ‘H’ holds the data value first stored in thebuffer and this datum will be the first data value taken out the buffer.The tail location ‘T’ holds the last data value stored in the buffer andwill be the last data value read out.

FIG. 7 is a diagram of a data block 700 of a single cache line and showsa further example of how data in a cache line may be arranged for use asa virtual FIFO buffer in a cache of a consumer device. The headindicator field 702 is copied from the tail indicator field of thecorresponding producer cache line and indicates the number of valid dataentries in the buffer. The element size field 704 is copied from theproducer cache line. Field 706 is unused or may be reserved foradditional data. In this embodiment, the valid data in producer cacheline is shifted left by the link controller during transfer, so that thetail of the buffer is at the left-most end in of the 62-byte data field708. At each read operation, the value in the head indicator field isdecremented by the cache controller of the consumer processor todetermine the location 710 of the next data to be read from the buffer.When the head indicator reaches zero, the virtual link buffer is emptyand a new request must be made to the interconnect if more data isrequired.

In a still further embodiment, data in the producer cache line isreversed when transferred to the consumer device, so that tail indicatorbecomes the head indicator.

In a still further embodiment, the data is transferred unchanged and allof the valid data in the consumer cache line is read in one go. In thisembodiment there is no requirement to store a head indicator in thecache line.

In a still further embodiment, data in the producer cache line istransferred to the consumer device, the head indicator of field 702 isused as a valid count of data elements, and inline bits (e.g., errorcorrection bits) are used to indicate valid offsets within the consumercache line.

In a still further embodiment, a cache line includes a number of bitsused as an error correction code (ECC). For example, one ECC bit may beallocated for each byte of data in a cache line. When a cache line isused as a virtual buffer, the ECC bits may be used to indicate if anassociated byte of data is valid. In this embodiment, data bytes may bestored in any order, with the associated ECC bits in a producer cacheline indicating which bytes have been written to and ECC bits a consumercache line indicating which valid bytes have not been read yet. Thisembodiment enables data transfer in a predefined sequence or in a randomor unspecified order.

In a still further embodiment, only full cache lines are transferred tothe consumer device, in which case a tail index is not used by theconsumer device.

In the embodiment disclosed above, the link controller maintains a tablethat records the producer-consumer pairs. In a further embodiment, theconsumer address is encoded in the producer cache line itself. Inoperation, the link controller simply reads the consumer address from acache line received from a producer, renames the line and passes theline over the coherence network to the consumer under the consumersaddress. This approach eliminates the need for a look-up table in thelink controller.

In one embodiment, a virtual link buffer is implemented using specificcreation, destruction and push/pop instructions. In another embodiment,standard load/store instructions are used to access a reserved region ofthe virtual address space.

Explicit creation of a virtual link with FIFO ordering may use thefollowing instructions:

-   -   mkfifo,    -   destroy_fifo,    -   open_fifo_<type>.        where <type> may be ‘producer’ or ‘consumer’. Standard        load/store instructions may then be used to access the virtual        FIFO buffer, once created, using the memory address.

Virtual link buffers may be used in a stream/data-flow orientedarchitecture. In one embodiment, all producer/consumer pseudo-addresspairs are located within the same virtual memory (VM) address space. Ina further embodiment, an operating system is used to map shared memoryacross address spaces.

Example instructions for creation of a FIFO ordered virtual link andtheir descriptions are listed in TABLE 1.

TABLE 1 Instruction Description make_fifo <r1> <r2> the first operand r1gets a producer “handle” virtual address the second operand r2 gets aconsumer “handle” virtual address The make_fifo instruction initializesthe FIFO within the system, it creates two handles in the VM space (notreal pointers, or backed by memory of any kind. These virtual addressesare used as handles within the same virtual memory space to access theFIFOs created, either by the producer or consumer. The producer andconsumer handles can be mapped using the operating system as well vs. aninstruction recognized by the hardware in an analogous manner to sharedmemory for inter-process communication. These handles would then beregistered with the link controller by a privileged process (e.g.,operating system). destroy_fifo <r1> Destruction can be done by any ofthe producers or consumers. Optionally, both handles could be used,however within the link controller both addresses map to either side ofthe FIFO so either could be used to destroy it. Accessing a destroyedFIFO may throw a fault (as an example implementation).open_fifo_producer <r1> <r2> This instruction takes in two registers andis used to open a producer FIFO handle to receive an actual virtualaddress which can be used in conjunction with a standard storeinstruction. The first operand r1 is supplied as the producer FIFOhandle. The second operand r2 is filled by the instruction as theaddress to use for stores to the FIFO. This is the cache line that willget the formatted cache line discussed above. open_fifo_consumer <r1><r2> This instruction takes in two registers, and is used to open aconsumer FIFO handle to receive an actual virtual address which can beused in conjunction with a standard load instruction. The first operandr1 is suppled as the consumer FIFO handle. The second operand r2 isfilled by the instruction as the address to use for stores to the FIFO.This is the cache line that will get the format discussed above.open_fifo_broadcast_producer This instruction takes in two registers andis used to <r1> <r2> open a producer broadcast FIFO handle to receive anactual virtual address which can be used in conjunction with a standardstore instruction. The first operand r1 is supplied as the producer FIFOhandle. The second operand r2 is filled by the instruction as theaddress to use for stores to the FIFO. This is the cache line that willget the formatted cache line discussed above. Thisinstruction/function/command conveys to the hardware (either through thehardware or through software setup of the hardware) that the linespushed are to be broadcast to all consumers (each consumer receives acopy of the contents before line is popped) open_fifo_broadcast_consumerThis instruction takes in two registers, and is used to <r1> <r2> open aconsumer broadcast FIFO handle to receive an actual virtual addresswhich can be used in conjunction with a standard load instruction. Thefirst operand r1 is suppled as the consumer FIFO handle. The secondoperand r2 is filled by the instruction as the address to use for storesto the FIFO. This is the cache line that will get the format discussedabove The difference between this one and the otherinstruction/function/command is that this version conveys to thehardware (either through the hardware or through software setup of thehardware) that the lines consumed are to be broadcast from all consumers(each consumer receives a copy of the contents before line is popped).optional instructions These instructions may be used in embodiments toreduce the complexity of the link controller and (if memory storm notused) the Translation Look-aside buffer (TLB) logic. push_fifo <r1> <r2>This instruction takes the valid address returned by theopen_fifo_producer instruction as r1 and a value into r2, the value iswritten to the FIFO associated with r1. pop_fifo <r1> <r2> Thisinstruction takes the valid address returned by the open_fifo_consumerinstruction as r1 and, on execution, places the value from the head ofthe FIFO into r2. This instruction could block if no values are in theFIFO or throw an error/interact with scheduler (implementationdependent).

A virtual link buffer with LIFO order or other data link buffer may beimplemented in an analogous manner.

A link buffer may provide data transfer between a producer and a singleconsumer or a producer and multiple consumers. In one embodiment, datais broadcast to multiple consumers so that multiple threads can sharethe same data values. Cache lines are pushed to the link controller tobe broadcast to all consumers. Each consumer receives a copy of thecontents before a line is popped. For each handle that is shared betweena producer-consumer pair, a ‘pop’ counter is provided in the linkcontroller. Data values are not considered to be completely ‘popped’until the counter value is equal to the number of consumers.Alternatively, a bit-field may be used to identify which consumers havereceived the cache line.

In a potential embodiment, a link controller may also signal anexecution thread scheduler upon arrival of data to a given buffer set.

On creation of a virtual link buffer (using the make_fifo instruction,for example) a virtual link buffer is registered with the producerhandle stored in register <r1> and a consumer handle in register <r2>.The same handle may be used for both producer and consumer, since thedirectionality information can be provided through the open instruction.Alternatively, different handles may be used. An additional layer ofsafety is provided and decoding/checking in hardware easier if thehandles are specified at the mkfifo instruction level.

FIG. 8 is a flow chart 800 showing operation of a make_fifo instructionissued at block 802. The instruction has operands <r1> and <r2>. Atblock 804, the instruction is trapped to the link controller in theinterconnect structure. At block 806, the link controller creates tableentries for producer and consumer handles for a virtual link buffer. Anembodiment of the table (224 in FIG. 1) is discussed in more detailbelow with reference to FIG. 12. At block 808, the producer handle isreturned in the first operand, <r1>, and the consumer handle is returnedin the second operand, <r2>. The operation ends at 810. In a furtherembodiment, assignment of the producer and consumer handles is performedby an operating system.

The make_fifo instruction initializes the virtual FIFO within the systemand creates two handles in the Virtual Memory space. The handles do notcorrespond to real pointers and are not backed by memory of any kind.These pseudo-addresses are used as handles within the same virtualmemory space to access the links created, either by the producer or theconsumer. In a further embodiment the virtual addresses responded to bythe hardware could be assigned by the software.

FIG. 9 is a flow chart 900 showing operation of an open_fifo_producerinstruction issued at block 902. The instruction has operands <r1> and<r2>. In this example, the instruction is issued by a producer device.However, an open_fifo_consumer instruction issued by a consumer deviceis treated in an analogous manner. In response to the instruction, avirtual link buffer is opened. At decision block 904, the pseudo-addressor handle range in operand <r1> is checked to see if it is in thevirtual address range reserved for virtual FIFO addresses. If it is notin a valid FIFO range, as depicted by the negative branch from decisionblock 904, a fault is signaled at block 906 and the process terminatesat block 908. If the address or handle is in a valid FIFO range, theinstruction is trapped to the link controller at block 910. At block912, a virtual address is issued for the virtual link buffer in thereserved address range, the FIFO is registered with that address in thelink controller. At block 914, the issued address is returned to theprocessing device in register <r2>. The returned address will identifythe cache line to be formatted for use as a virtual FIFO. The process isthe same for both producer and consumer devices, with the exception thatthe handles and virtual address are registered in the appropriatedirection.

The virtual addresses returned by the link controller correspond to thestart of a cache line (after translation). Each registered producer andconsumer device gets a single cache line address that it will use whilethe FIFO exists for that producer or consumer.

In one embodiment, there may be multiple producers or consumers. Theinterconnect structure is used to fetch and retrieve FIFO cache lines.The FIFO lines are formatted as described above with respect to FIGS.5-7. As detailed above, the amount of valid data stored in the cacheline is saved within the cache line itself. It is noted that nomodifications need be made to the normal cache coherence protocol,although further optimizations will be apparent to those of ordinaryskill in the art and may result in a modified protocol.

FIG. 10 is a flow chart 1000 showing operation of a store instructionissued at block 1002. This example relates to a virtual link buffer withFIFO ordering, but other virtual link buffers may be operated in ananalogous manner. The store instruction has operands <r1> and <r2>. Theinstruction is issued by a producer device. At decision block 1004, thevirtual memory address in operand <r1> is checked to see if it is in theaddress range assigned to virtual FIFO handles. If it is not in a validFIFO range, as depicted by the negative branch from decision block 1004,a standard (non-FIFO) store to the cache line is performed at block 1006and the process terminates at block 1008. If the address or handle is ina valid FIFO range, as depicted by the positive branch from decisionblock 1004, the cache line as accessed as a virtual FIFO at block 1010.At block 1012 the cache controller reads the tail indicator stored inthe designated region of the cache line. If tail indicator indicatesthat the cache line is not yet full, as depicted by the negative branchfrom decision block 1014, the data in operand <r2> is stored in thecache line at the location indicated by the tail indicator at block1016. The tail indicator is updated at block 1018 and processing of theinstruction is complete as indicated by termination block 1020. The tailindicator is updated by incrementing in the example cache line formatdescribed above with reference to FIG. 5. However, other formats may beused. If tail indicator indicates that the cache line is not yet full,as depicted by the positive branch from decision block 1014, the linkcontroller is signaled at block 1022. In response, the link controllersnoops the cache line in a shared state at block 1024. If a snoopedconsumer device indicates that it is ready to receive FIFO data, asdepicted by the positive branch from decision block 1026, the full cacheline is pushed from the producer device to the consumer device at block1028. In the producer device, the tail indicator in the cache line isreset and the state of the cache line is set to exclusive (‘E’) at block1030. Processing of the instruction is complete, as indicated bytermination block 1032. If the snooped consumer device indicates that itis not ready to receive FIFO data, as depicted by the negative branchfrom decision block 1026, the full cache line is buffered at block 1034.The cache line may be buffered in virtual memory or internal SRAM, forexample. In a further embodiment, the instruction is blocked until thefull cache line has been transferred to the consumer device. Processingof the instruction is then complete, as indicated by termination block1032.

FIG. 11 is a flow chart 1100 showing operation of a load instructionissued at block 1102. The instruction has operands <r1> and <r2>. Theinstruction is issued by a consumer device to load data from a virtuallink buffer. At decision block 1104, the address in operand <r1> ischecked to see if it is in the range of virtual addresses reserved forvirtual FIFOs. In one embodiment, the address may be checked using atranslation lookaside buffer (TLB). If it is not in a valid FIFO range,as depicted by the negative branch from decision block 1104, the cacheline is accessed as a standard (non-FIFO) load at block 1106 and theprocess terminates at block 1108. If the address or handle is in a validFIFO range, as depicted by the positive branch from decision block 1104,the cache line as accessed as a virtual FIFO at block 1110. At block1112 the cache controller reads the head indicator stored in thedesignated region of the cache line. If head indicator indicates thatthe cache line is not empty, as depicted by the negative branch fromdecision block 1114, the data in operand <r2> is loaded from the cacheline at the location indicated by the head indicator at block 1116. Thehead indicator is updated at block 1118 and processing of theinstruction is complete, as indicated by termination block 1120. Thehead indicator may be incremented or decremented depending upon theformat selected for the cache line. If head indicator indicates that thecache line is empty, as depicted by the positive branch from decisionblock 1114, the link controller is signaled at block 1122. In responseto the signal, the link controller determines, at decision block 1124,if buffered data (stored when a producer line becomes full, for example)is available. If buffered data is available, as depicted by the positivebranch from decision block 1124, the buffered cache line is sent to theconsumer device at block 1126 and flow continues to block 1116. If nobuffered data is available, as depicted by the negative branch fromdecision block 1124, the link controller snoops producer devices atblock 1128 to discover filled or partially filled lines for the virtuallink buffer. The snooped address is the producer address that isassociated with the consumer handle. If a line exists, as depicted bythe positive branch from decision block 1130, the producer device sendsthe line to the link controller and the link controller changes the lineaddress from the producer address to the corresponding consumer addressat block 1132. In some embodiments, the data in the received cache linemay be rearranged, as discussed in the examples above. At block 1134,the cache line in the producer device is invalidated or reset to preparefor new data. Flow continues to block 1116 where the consumer deviceloads the data at the position in the cache line indicated by the headindicator. If no corresponding producer cache line exists, as depictedby the negative branch from decision block 1130, the request, togetherwith the consumer FIFO handle, is registered at the link controller atblock 1136 and the process terminates at block 1138. The registeredrequest may be serviced at a later time, either in response to latersnoop requests to producer devices or in response to receiving a fullcache line from a producer device, for example.

FIG. 12 shows a link controller table 224, in accordance with certainembodiments. When a virtual link with FIFO ordering is created, usingthe make_fifo <r1> <r2> instruction for example, an entry is made to thetable 224. A shared identifier of the virtual link buffer is stored incolumn 1202. The identifier may be virtual memory address of the FIFO,for example. A producer handle is assigned and stored in column 1204 anda consumer handle is assigned and stored in column 1206. The handlescorrespond to pseudo-addresses, which in turn map the virtual link intocorresponding caches of the producer and consumer device. If the virtuallink is backed in virtual memory, an initial chunk of virtual memory maybe allocated. Various mechanisms may be used for the allocation, forexample, pre-allocation, dynamic trapping to software allocators, or thecorresponding mechanisms for varying combinations of hardware/softwaremanaged memory.

Table 224 may also include bit-field 1208 that indicates if a producercache line is ready and bit-field 1210 that indicates if a consumerdevice is ready to receive cache line data. In an alternativeembodiment, bit-field 1210 may contain a single bit, or one bit for eachconsumer when the produced data is to be sent to multiple consumers.

In some embodiments, a producer cache line is transferred to a consumercache when (a) the line is filled and the consumer is ready to receiveit, or (b) when a line is at least partially filled and is requested bya consumer device.

In a further embodiment, filled producer cache lines are buffered inmemory until needed by a consumer device. The buffering of cache linesincreases the capacity of the link. It also serves to buffer burst-likebehavior when an inter-arrival rate from producer and service processesare not deterministic. When ordering of the data in the link buffer isrequired, the link controller stores a buffer table that providesinformation for maintaining the desired order of lines. Thus, data orderwithin a cache line is maintained by an index stored in the line itself,while data order of lines buffered in memory is maintained through useof a buffer table.

FIG. 13 shows a buffer table 1300 for tracking virtual link buffers thatare stored in memory, in accordance with certain embodiments. Memory maybe used to buffer filled producer caches lines. For each of the virtuallink buffers listed in column 1302, buffer table 1300 stores a pointerto a corresponding buffer in memory in column 1304 and a tail offset incolumn 1306. These indicators identify the buffered lines that have beenreceived from a producer device but not yet transferred to a consumerdevice. The buffered lines may be stored in a reserved region of memory.In a further embodiment, the buffered lines may be identified via alinked list or other order structure indicative of line order. When datais to be sent to multiple consumers or broadcast to all consumers,buffer table 1300 may include an entry that indicates when alldesignated consumers have received that data. The entry may be, forexample, a counter or a bit-field that is updated when data is sent to aconsumer.

FIG. 14 is a flow chart 1400 of a method of operation of a producerdevice for link buffer communication, in accordance with certainembodiments. In this embodiment, the link controller is used inconjunction with a standard virtual memory pathway. The flow chart 1400describes an example embodiment where the link buffer is a FIFO buffer,but analogous methods may be used for other communication patterns. Atblock 1402 an instruction is issued from a producer device to push orstore a data value <DATA> to a virtual address <VA>. The virtual addressis then mapped to a corresponding physical address. In this embodiment,the virtual address <VA> is used at block 1404 to access a translationlook-aside buffer (TLB). A corresponding memory page is retrieved atblock 1406. At block 1408 the data line at the physical address isretrieved at block 1408 and loaded in the cache of the producer device.The data in the line may be formatted as described above. At block 1410,the tail index is read from the assigned position in the formatted cacheline. The tail index may indicate the number of valid entries in thecache line and/or the position in the line for the next store operation.If the tail index indicates that the cache line is not full, as depictedby the negative branch from decision block 1412, the data value <DATA>is stored, at block 1414, at the indicated position in the cache lineand the tail index is updated at block 1416. The response to the pushinstruction terminates at block 1418. If, however, the tail indexindicates that the cache line is full, as depicted by the positivebranch from decision block 1412, the cache line is pushed to the linkcontroller at block 1420. At 1422, an ‘acknowledge’ signal is receivedfrom the link controller to indicate receipt of the cache line and thecache line is reset at block 1424. At block 1426, the data value <DATA>is written to the first entry in the cache line and the tail index isupdated accordingly. The response to the push instruction terminates atblock 1418. In this way, the producer device is able to write data tothe virtual link buffer.

FIG. 15 is a flow chart 1500 of a method of operation of a linkcontroller for link buffer communication, in accordance with certainembodiments. Following start block 1502, the link controller receives afull cache line from a producer device at block 1504. At block 1506, theline address is looked-up in a producer content addressable memory (CAM)or similar. If a corresponding memory address is not found, asdetermined by the negative branch from decision block 1508, a fault isindicated at block 1510 and the method terminates at bock 1512. If acorresponding memory address is found, as determined by the positivebranch from decision block 1508, the received cache line data isbuffered in the memory at block 1514 and an ‘acknowledge’ signal is sentto the producer device at block 1516. At block 1518, the consumer listassociated with the producer is identified in the link controller table.The identified consumer list is scanned to determine if a consumer hasrequested the received data. If a consumer device has requested thedata, as depicted by the positive branch from decision block 1520, thebuffered line is written to the interconnect bus consumer and directedtowards the address of the requesting consumer device at block 1522. Themethod terminates at block 1512. If no consumer has requested the data,as depicted by the negative branch from decision block 1520, thebuffered line is pushed to memory (such as DRAM) at block 1524 andaddressed with the consumer tag for the virtual link buffer. The memorymay be allocated by the operating system kernel, for example. The datamay also be buffered if there are multiple consumers of the data and notall are ready to receive the data. In this case, an entry in the linkcontroller table identifies the number of consumer that have receivedthe data or which consumers have received the data. If the memory bufferis full, as depicted by the positive branch from decision block 1526,flow continues to block 1528 where receipt of further cache lines may beblocked, a fault may be thrown or some other action may be taken. Theprocess terminates at block 1530. If the memory buffer is not full, asdepicted by the negative branch from decision block 1526, a ‘dataavailable’ bit (e.g. 1208 in FIG. 12) is set to signal to the one ormore consumers that data is available in the memory buffer at block1532, and the process terminates at block 1530.

FIG. 16 is a further flow chart 1600 of a method of operation of a linkcontroller for link buffer communication, in accordance with certainembodiments. At block 1602 an instruction is issued from a consumerdevice to pop or load a data value <DATA> from a virtual address <VA>.The virtual address is then mapped to a corresponding physical address.In this embodiment, the virtual address <VA> is used at block 1604 toaccess a translation look-aside buffer (TLB). A corresponding memorypage is retrieved at block 1606 and the physical address is retrieved atblock 1608. If the line at the physical address is not found in the listof consumers in the table maintained by the link controller, as depictedby the negative branch from decision block 1610, a fault is deemed tohave occurred. A selected action is performed and the method terminatesat block 1612. If the physical address is found in the list of consumersin the table maintained by the link controller, as depicted by thepositive branch from decision block 1610, a ‘data ready’ bit in the linkcontroller table (1208 in FIG. 12, for example) is checked at block 1614to determine if data is available from a producer device. If no data isavailable, as depicted by the negative branch from decision block 1614,a ‘need data’ bit (1210 in FIG. 12, for example) is set to indicate arequest for data at block 1616 and the method terminates at block 1618.If data is available, as depicted by the positive branch from decisionblock 1614, the next data line is retrieved from the buffer memory atblock 1620. If the data is being shared between a number of consumerdevices, a coherence state is associated with the line. The coherencestate is checked at block 1622. If the coherence state indicates thatthe line is exclusively owned (‘E’) the buffer table is updated at block1626 to indicate the new buffer head, the ‘need data’ bit is cleared,and the data line is transferred with the address of the requestingconsumer device at block 1628. The response to the request is completeand the method terminates at block 1630. If the coherence stateindicates that the line is not exclusively owned (‘E’), as depicted bythe negative branch from decision block 1624, the link controller waitsuntil the state becomes exclusive.

FIG. 17 is a diagrammatic representation of data processing system 1700that implements a virtual FIFO buffer 1702. Virtual FIFO buffer 1702 hasa number of data entries indicated by the grayed locations. The firstentry is the head entry denoted as ‘H’ and the last or most recent entryis the tail entry denoted ‘T’. These positions vary as data is writtento or read from the FIFO buffer, so FIG. 17 represents the configurationat a particular moment in time. In this embodiment, the data processingsystem 1700 includes a producer cache 1704, a consumer cache 1706 and amemory 1708. A cache line contains 64 bytes in this example, of which 62bytes contain data and 2 bytes contain metadata. The virtual buffer 1702is divided into several sections of length 62 bytes. At the time shown,the oldest section, which includes entries H-A, is stored as a line inthe consumer cache 1706 for access by the consumer device, section B-Cand section D-E are stored in memory 1708 and the most recent sectionwith entries F-T is stored in a line of the producer cache 1704 and isupdated by the producer device 1704. Each cache line stores two bytes ofmetadata that include a tail or head index ‘I’ that is updated by thecache controller of the associated device. When a line in the producercache is filled, it transferred to the link controller as describedabove with reference to FIG. 14 and as indicated by arrow 1710 in FIG.17. When a consumer requests more FIFO data the request is serviced bythe link controller as described above with reference to FIG. 16 and asindicated by arrow 1712 in FIG. 17.

Cache line data received from the producer cache may be buffered inmemory by the link controller, as described with reference to FIG. 15.In FIG. 17, section B-C and section D-E are buffered in memory 1708. Theorder of the buffered lines may be maintained by the link controller. Inthe example embodiment shown, buffered lines are stored consecutively.Buffer head pointer 1714 is updated when lines are transferred to theconsumer and buffer tail pointer 1716 are updated as lines as receivedfrom the producer. In a further embodiment, the order of the bufferedlines is maintained by a linked list or other equivalent data structurecapable or maintaining order of link buffer data in memory).

The integrated circuits disclosed above may be defined as a set ofinstructions of a Hardware Description Language (HDL). The instructionsmay be stored in a non-transient computer readable medium. Theinstructions may be distributed via the computer readable medium or viaother means such as a wired or wireless network. The instructions may beused to control manufacture or design of the integrated circuit, and maybe combined with other instructions.

Although illustrative embodiments of the invention have been describedin detail herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various changes and modifications can be effectedtherein by one skilled in the art without departing from the scope andspirit of the invention as defined by the appended claims.

It will be appreciated that the devices, systems, and methods describedabove are set forth by way of example and not of limitation. Absent anexplicit indication to the contrary, the disclosed steps may bemodified, supplemented, omitted, and/or re-ordered without departingfrom the scope of this disclosure. Numerous variations, additions,omissions, and other modifications will be apparent to one of ordinaryskill in the art. In addition, the order or presentation of method stepsin the description and drawings above is not intended to require thisorder of performing the recited steps unless a particular order isexpressly required or otherwise clear from the context.

The method steps of the implementations described herein are intended toinclude any suitable method of causing such method steps to beperformed, consistent with the patentability of the following claims,unless a different meaning is expressly provided or otherwise clear fromthe context. So for example performing X includes any suitable methodfor causing another party such as a remote user, a remote processingresource (e.g., a server or cloud computer) or a machine to perform X.Similarly, performing elements X, Y, and Z may include any method ofdirecting or controlling any combination of such other individuals orresources to perform element X, Y, and Z to obtain the benefit of suchsteps. Thus method steps of the implementations described herein areintended to include any suitable method of causing one or more otherparties or entities to perform the steps, consistent with thepatentability of the following claims, unless a different meaning isexpressly provided or otherwise clear from the context. Such parties orentities need not be under the direction or control of any other partyor entity, and need not be located within a particular jurisdiction.

It should further be appreciated that the methods above are provided byway of example. Absent an explicit indication to the contrary, thedisclosed steps may be modified, supplemented, omitted, and/orre-ordered without departing from the scope of this disclosure.

It will be appreciated that the methods and systems described above areset forth by way of example and not of limitation. Numerous variations,additions, omissions, and other modifications will be apparent to one ofordinary skill in the art. In addition, the order or presentation ofmethod steps in the description and drawings above is not intended torequire this order of performing the recited steps unless a particularorder is expressly required or otherwise clear from the context. Thus,while particular embodiments have been shown and described, it will beapparent to those skilled in the art that various changes andmodifications in form and details may be made therein without departingfrom the scope of this disclosure and are intended to form a part of thedisclosure as defined by the following claims, which are to beinterpreted in the broadest sense allowable by law.

The various representative embodiments, which have been described indetail herein, have been presented by way of example and not by way oflimitation. It will be understood by those skilled in the art thatvarious changes may be made in the form and details of the describedembodiments resulting in equivalent embodiments that remain within thescope of the appended claims.

Accordingly, some features of the disclosed embodiments are set out inthe following numbered items:

1. A data processing system for providing a hardware-accelerated virtuallink buffer, the data processing system comprising:

-   -   a first cache accessible by a first processing device;    -   a second cache accessible by a second processing device; and    -   an interconnect structure that couples the first cache and the        second cache, the interconnect structure comprising a link        controller;        where the first cache is configured to store a plurality of data        elements produced by the first processing device in a producer        cache line;        where the link controller comprises hardware configured to        transfer data elements in the producer cache line to a consumer        cache line in the second cache; and        where the second cache is configured to access a consumer cache        line to provide the plurality of data elements, produced by the        first processing device, to the second processing device.        2. The data processing system of item 1, further comprising a        first cache controller, where the plurality of data elements are        produced in sequence, where a data element of the plurality of        data elements is stored at a location in the producer cache line        indicated by a store position indicator, where the store        position indicator is stored at a predetermined location in the        producer cache line and where the first cache controller is        configured to access the store position indicator.        3. The data processing system of item 1, further comprising a        second cache controller where a data element of the plurality of        data elements is loaded from a location in the consumer cache        line indicated by a load position indicator, where the load        position indicator is stored at a predetermined location in the        consumer cache line and where the second cache controller is        configured to access the load position indicator.        4. The data processing system of item 1, further comprising a        memory, where the producer cache line is associated with a        producer handle and where the consumer cache line is associated        with a consumer handle, where the producer handle and the        consumer handle are stored in a table in the memory and where        the table is accessible by the link controller.        5. The data processing system of item 1, further comprising a        memory, where the link controller is configured to buffer data        elements transferred from the producer cache line to the        consumer cache line in the memory.        6. The data processing system of item 5, where the link        controller is configured to maintain an order of buffered data        elements.        7. The data processing system of item 1, where the link        controller is configured to transfer data elements in the        producer cache line to consumer cache lines in second caches of        a plurality of second processing devices.        8. The data processing system of item 1, further comprising the        first processing device and the second processing device.        9. A method for providing a virtual link buffer between a        producer processing device and a consumer processing device in a        data processing system, the method comprising:    -   allocating a first virtual address for identifying a cache line        in a cache of the producer processing device;    -   allocating a second virtual address for identifying a cache line        in a cache of the consumer processing device;    -   storing, by the producer processing device, one or more data        elements in a first cache line of the producer processing        device, the first cache line identified by the first virtual        address;    -   transferring, by a link controller in an interconnect structure        that couples the producer and consumer processing devices, the        one or more data elements in the first cache line to a second        cache line in the cache of the consumer processing device, the        second cache line identified by the second virtual address; and    -   loading, by the consumer processing device, the one or more data        elements from the second cache line.        10. The method of item 9, where storing, by the producer        processing device, a data element of the one or more data        elements in the first cache line of the producer processing        device comprises:    -   reading a store position indicator from a designated location in        the first cache line;    -   storing the data element at a location in the first cache line        indicated by the store position indicator; and    -   updating the store position indicator in the first cache line.        11. The method of item 10, where updating the store position        indicator comprises:    -   reading a data element size from a designated location in the        first cache line; and    -   modifying the store position indicator dependent upon the data        element size.        12. The method of item 9, where the first cache line includes a        plurality of error correction code (ECC) bits associated with        data locations in the first cache line and where storing, by the        producer processing device, a data element of the one or more        data elements in the first cache line of the producer processing        device comprises:    -   storing the data element at a first location in the first cache        line indicated by the store position indicator; and    -   updating an ECC bit, of the plurality of ECC bits, associated        with in the first location to indicate that data at the first        location is valid.        13. The method of item 9, where loading, by the consumer        processing device, the one or more data elements from the second        cache line comprises:    -   reading a load position indicator from a designated location in        the second cache line;    -   loading the data element at a location in the second cache line        indicated by the load position indicator; and    -   updating the load position indicator in the second cache line.

14. The method of item 9, where storing, by the producer processingdevice, one or more data elements in the first cache line of theproducer processing device comprises:

-   -   translating the first virtual address to a first intermediate        address in a storage device of the data processing system; and    -   identifying the first cache line from the first intermediate        address.        15. The method of item 9, further comprising:    -   allocating, by the link controller, a producer handle comprising        a pseudo-address for enabling the producer processing device to        reference the virtual link buffer;    -   allocating, by the link controller, a consumer handle comprising        a pseudo-address for enabling the consumer processing device to        reference the virtual link buffer; and    -   associating, by the link controller, the producer handle with        the consumer handle.        16. The method of item 9, where transferring the one or more        data elements in the first cache line of the producer processing        device to the second cache line in the cache of the consumer        processing device comprises:    -   transferring the first cache line to the link controller;    -   storing, by the link controller, the first cache line in a line        buffer in a memory of the data processing system; and    -   transferring the first cache line from the line buffer to the        second cache line of the consumer processing device.        17. The method of item 16 where the line buffer comprises a        first-in, first-out line buffer.        18 The method of item 16, where the line buffer comprises a        first-in, last-out line buffer.        19. The method of item 16, where the line buffer comprises an        unordered buffer.        20. The method of item 16, further comprising the link        controller maintaining an order of lines stored in the line        buffer by accessing a memory, where the memory contains one or        more of a table, a head pointer, a tail pointer or a linked        list.        21. The method of item 16, further comprising maintaining a        coherence state of the first cache line transferred from the        producer to the link controller and stored in the line buffer.        22. The method of item 9, where transferring the one or more        data elements in the first cache line to the second cache line        comprises the link controller:    -   receiving a request from the consumer processing device for data        associated with the second virtual address allocated to the        virtual link buffer;    -   determining if one or more cache lines associated with the        virtual link buffer are stored the line buffer;    -   when one or more cache lines associated with the virtual link        buffer are stored the line buffer:        -   selecting a cache line of the one or more stored caches            lines; and        -   transferring one or more data elements in the selected cache            line from the line buffer to the consumer processing device.            23. The method of item 9 where transferring the one or more            data elements in the first cache line to the second cache            line comprises the link controller:    -   receiving a request from the consumer processing device for data        associated with the second virtual address allocated to the        virtual link buffer;    -   identifying the first virtual address allocated to the virtual        link buffer;    -   requesting a cache line associated with the identified first        virtual address from the producer processing device; and    -   transferring a cache line received from the producer processing        device to the consumer processing device.        24. The method of item 23, where requesting the cache line        associated with the identified first virtual address from the        producer processing device comprises requesting a cache line        associated with a physical address that maps to the identified        first virtual address.        25. The method of item 9, further comprising:    -   reading a store position indicator from a designated location in        the first cache line; and    -   determining from the store position indicator if the first cache        line is full;        where transferring the one or more data elements in the first        cache line of the producer processing device to a second cache        line in the cache of the consumer processing device comprises:    -   transferring the first cache line to the link controller when        the first cache line is full.        26. The method of item 9, further comprising transferring the        first cache line from the buffer in memory to the cache line of        the second cache line in response to a signal from the consumer        processing device.        27. The method of item 9, where transferring, by link        controller, the one or more data elements in the first cache        line to the second cache line comprises:    -   rearranging, by the link controller, the position or order of        the one or more data elements in the cache line.        28. A method for providing a virtual link buffer between a        producer processing device and a consumer processing device in a        data processing system, the method comprising:    -   allocating a first virtual address for identifying a cache line        in a cache of the producer processing device;    -   allocating a second virtual address for identifying a cache line        in a cache of the consumer processing device, where the second        virtual address is associated with the first virtual address;    -   receiving a first cache line associated with the first virtual        address from the producer processing device, the first cache        line comprising one or more data elements stored by the producer        processing device, the first cache line identified by the first        virtual address;    -   determining the second virtual address associated with the first        virtual address;    -   and    -   transferring the first cache line with the second virtual        address to the consumer processing device.        29. The method of item 28, where determining the second virtual        address associated with the first virtual address comprises        accessing a table of virtual address pairs.        30. The method of item 28, where determining the second virtual        address associated with the first virtual address comprises        accessing a second virtual address encoded in the first cache        line.        31. The method of item 28, where transferring the first cache        line with the second virtual address to the consumer processing        device comprises buffering the first cache line in a memory.        32. A method for providing a virtual link buffer between a        producer processing device and a plurality of consumer        processing devices in a data processing system, the method        comprising:    -   allocating a first virtual address for identifying a cache line        in a cache of the producer processing device;    -   storing, by the producer processing device, one or more data        elements in a first cache line of the producer processing        device, the first cache line identified by the first virtual        address;    -   transferring the one or more data elements in the first cache of        the producer processing device to a link controller of the data        processing system;    -   storing, by the link controller, the received one or more data        elements;    -   transferring, by the link controller, the one or data elements        to one or more consumer processing devices of the plurality of        consumer processing devices; and    -   determining if the one or more data elements have been        transferred to all of the plurality of consumer processing        devices.        33. The method of item 32, where determining if the one or more        data elements have been transferred to all of the plurality of        consumer processing devices comprises the link controller        maintaining a count of devices to which the one or more data        elements have been transferred.        34. The method of item 32, where determining if the one or more        data elements have been transferred to all of the plurality of        consumer processing devices comprises the link controller        maintaining a bit-map of devices to which the one or more data        elements have been transferred.        35. The method of item 32, where transferring, by the link        controller, the one or data elements to a consumer processing        device of the plurality of consumer processing devices is        performed in response to a request from that consumer processing        device.

1. A data processing system for providing a hardware-accelerated virtuallink buffer, the data processing system comprising: a first cacheaccessible by a first processing device; a second cache accessible by asecond processing device; and an interconnect structure that couples thefirst cache and the second cache, the interconnect structure comprisinga link controller; where the first cache is configured to store aplurality of data elements produced by the first processing device in aproducer cache line; where the link controller comprises hardwareconfigured to transfer data elements in the producer cache line to aconsumer cache line in the second cache; and where the second cache isconfigured to access a consumer cache line to provide the plurality ofdata elements, produced by the first processing device, to the secondprocessing device.
 2. The data processing system of claim 1, furthercomprising a first cache controller, where the plurality of dataelements are produced in sequence, where a data element of the pluralityof data elements is stored at a location in the producer cache lineindicated by a store position indicator, where the store positionindicator is stored at a predetermined location in the producer cacheline and where the first cache controller is configured to access thestore position indicator.
 3. The data processing system of claim 1,further comprising a second cache controller where a data element of theplurality of data elements is loaded from a location in the consumercache line indicated by a load position indicator, where the loadposition indicator is stored at a predetermined location in the consumercache line and where the second cache controller is configured to accessthe load position indicator.
 4. The data processing system of claim 1,further comprising a memory, where the producer cache line is associatedwith a producer handle and where the consumer cache line is associatedwith a consumer handle, where the producer handle and the consumerhandle are stored in a table in the memory and where the table isaccessible by the link controller.
 5. The data processing system ofclaim 1, further comprising a memory, where the link controller isconfigured to buffer data elements transferred from the producer cacheline to the consumer cache line in the memory.
 6. The data processingsystem of claim 5, where the link controller is configured to maintainan order of buffered data elements.
 7. The data processing system ofclaim 1, where the link controller is configured to transfer dataelements in the producer cache line to consumer cache lines in secondcaches of a plurality of second processing devices.
 8. The dataprocessing system of claim 1, further comprising the first processingdevice and the second processing device.
 9. A method for providing avirtual link buffer between a producer processing device and a consumerprocessing device in a data processing system, the method comprising:allocating a first virtual address for identifying a cache line in acache of the producer processing device; allocating a second virtualaddress for identifying a cache line in a cache of the consumerprocessing device; storing, by the producer processing device, one ormore data elements in a first cache line of the producer processingdevice, the first cache line identified by the first virtual address;transferring, by a link controller in an interconnect structure thatcouples the producer and consumer processing devices, the one or moredata elements in the first cache line to a second cache line in thecache of the consumer processing device, the second cache lineidentified by the second virtual address; and loading, by the consumerprocessing device, the one or more data elements from the second cacheline.
 10. The method of claim 9, where storing, by the producerprocessing device, a data element of the one or more data elements inthe first cache line of the producer processing device comprises:reading a store position indicator from a designated location in thefirst cache line; storing the data element at a location in the firstcache line indicated by the store position indicator; and updating thestore position indicator in the first cache line.
 11. The method ofclaim 10, where updating the store position indicator comprises: readinga data element size from a designated location in the first cache line;and modifying the store position indicator dependent upon the dataelement size.
 12. The method of claim 9, where the first cache lineincludes a plurality of error correction code (ECC) bits associated withdata locations in the first cache line and where storing, by theproducer processing device, a data element of the one or more dataelements in the first cache line of the producer processing devicecomprises: storing the data element at a first location in the firstcache line indicated by the store position indicator; and updating anECC bit, of the plurality of ECC bits, associated with in the firstlocation to indicate that data at the first location is valid.
 13. Themethod of claim 9, where loading, by the consumer processing device, theone or more data elements from the second cache line comprises: readinga load position indicator from a designated location in the second cacheline; loading the data element at a location in the second cache lineindicated by the load position indicator; and updating the load positionindicator in the second cache line.
 14. The method of claim 9, wherestoring, by the producer processing device, one or more data elements inthe first cache line of the producer processing device comprises:translating the first virtual address to a first intermediate address ina storage device of the data processing system; and identifying thefirst cache line from the first intermediate address.
 15. The method ofclaim 9, further comprising: allocating, by the link controller, aproducer handle comprising a pseudo-address for enabling the producerprocessing device to reference the virtual link buffer; allocating, bythe link controller, a consumer handle comprising a pseudo-address forenabling the consumer processing device to reference the virtual linkbuffer; and associating, by the link controller, the producer handlewith the consumer handle.
 16. The method of claim 9, where transferringthe one or more data elements in the first cache line of the producerprocessing device to the second cache line in the cache of the consumerprocessing device comprises: transferring the first cache line to thelink controller; storing, by the link controller, the first cache linein a line buffer in a memory of the data processing system; andtransferring the first cache line from the line buffer to the secondcache line of the consumer processing device.
 17. The method of claim 16where the line buffer comprises a buffer selected from the group ofbuffers consisting of a first-in, first-out line buffer, a first-in,last-out line buffer and an unordered buffer.
 18. The method of claim16, further comprising the link controller maintaining an order of linesstored in the line buffer by accessing a memory, where the memorycontains one or more of a table, a head pointer, a tail pointer or alinked list.
 19. The method of claim 9, where transferring the one ormore data elements in the first cache line to the second cache linecomprises the link controller: receiving a request from the consumerprocessing device for data associated with the second virtual addressallocated to the virtual link buffer; determining if one or more cachelines associated with the virtual link buffer are stored the linebuffer; when one or more cache lines associated with the virtual linkbuffer are stored the line buffer: selecting a cache line of the one ormore stored caches lines; and transferring one or more data elements inthe selected cache line from the line buffer to the consumer processingdevice.
 20. The method of claim 9 where transferring the one or moredata elements in the first cache line to the second cache line comprisesthe link controller: receiving a request from the consumer processingdevice for data associated with the second virtual address allocated tothe virtual link buffer; identifying the first virtual address allocatedto the virtual link buffer; requesting a cache line associated with theidentified first virtual address from the producer processing device;and transferring a cache line received from the producer processingdevice to the consumer processing device.
 21. The method of claim 9,further comprising: reading a store position indicator from a designatedlocation in the first cache line; and determining from the storeposition indicator if the first cache line is full; where transferringthe one or more data elements in the first cache line of the producerprocessing device to a second cache line in the cache of the consumerprocessing device comprises: transferring the first cache line to thelink controller when the first cache line is full.
 22. A method forproviding a virtual link buffer between a producer processing device anda consumer processing device in a data processing system, the methodcomprising: allocating a first virtual address for identifying a cacheline in a cache of the producer processing device; allocating a secondvirtual address for identifying a cache line in a cache of the consumerprocessing device, where the second virtual address is associated withthe first virtual address; receiving a first cache line associated withthe first virtual address from the producer processing device, the firstcache line comprising one or more data elements stored by the producerprocessing device, the first cache line identified by the first virtualaddress; determining the second virtual address associated with thefirst virtual address; and transferring the first cache line with thesecond virtual address to the consumer processing device.
 23. The methodof claim 22, where determining the second virtual address associatedwith the first virtual address comprises accessing a table of virtualaddress pairs.
 24. The method of claim 22, where determining the secondvirtual address associated with the first virtual address comprisesaccessing a second virtual address encoded in the first cache line. 25.The method of claim 22, where transferring the first cache line with thesecond virtual address to the consumer processing device comprisesbuffering the first cache line in a memory.